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Abstract. We study the problem of finding matches in a stream with unla- 
belled data. Where the data are not labelled, the only information we have 
is which items are the same and which differ. A pattern P of length m is 
said to match a substring of the stream T at position i if there is an injective 
(one-to-one) function / such that T[i-\-j] = f{P[j]) for all ^ j < m. Such a 
mapping corresponds to a labelling or relabelling of the symbols in the input 
and may be distinct for each alignment of the pattern and streaming text. 
This problem which has also been known under the name parameterised 
matching has applications from plagiarism detection in computer code to 
searching within cryptograms. We present both randomised and determinis- 
tic solutions. Our deterministic solution requires OdX'l -I- p) words of space, 
where \E\ is the number of distinct characters in the pattern and p is the 
parameterised period of the pattern. Our randomised solution improves the 
space requirements to OdX'l logm) words and is necessarily more sophisti- 
cated in its approach. Both algorithms take 0(Ayiog \ E\/ log log |X'|) time 
per new arriving symbol in the worst case. Our randomised algorithm finds 
all matches with high probability and we show that both space and time 
requirements are optimal up to logarithmic factors. 



1 Introduction 

We consider the problem of pattern matching in a stream with unlabelled data. In 
this setting the only information we have about the streaming symbols is which are 
the same and which differ. The search problem, which is also known as parameterised 
matching in offline settings, has at its origin the problem of finding duplication and 
plagiarism in software code although has since found a number of other applications. 
Since the first introduction of parameterised matching in an algorithmic setting, a 
great deal of work has gone into its study in both theoretical and practical settings 
(see e.g. P, |3-0j HI)- Perhaps the most basic relevant property of parameterised 



matching is that in an off-line setting, the exact parameterised matching problem 
can be solved in near linear time using a variant [1] of the classic linear time exact 
matching algorithm KMP [13]. 

In our streaming setting, the pattern or query is known in advance and the 
symbols of the stream arrive one at a time. Our task is to output if there is a match 
between the pattern and the latest suffix of the stream as soon as a new symbol 
arrives, where in this case the symbols of the stream are unlabelled. More formally, 
the pattern P of length m is said to match a substring of the stream T at position i 
if there is an injective (one-to-one) function / such that T[i+j] — f{P[j]) for all ^ 
j < m. Such a mapping corresponds to a labelling or relabelling of the symbols and 
may be distinct for each alignment of the pattern and streaming text. The matching 
problem can be viewed in a number of practical ways but perhaps the simplest is to 
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consider the task as that of finding matches in a stream encrypted using different 
substitution ciphers. To give a smah example, the pattern aba matches string xyxyx 
at all three alignments but only matches string xxxyx at the final alignment. 

Our interest is in tackling the parameterised matching problem in a streaming 
setting using minimal space and with guaranteed worst case running time. The 
field of pattern matching in a stream took a significant step forwards in 2009 when 
it was shown to be possible to solve (non-parameterised) exact matching using 
only O(logm) words of space and O(logm) time per new stream symbol ^5J. This 
method correctly finds all matches with high probability. The initial approach was 
subsequently somewhat simplified ^ and then finally improved to run in constant 
time [8| within the same space requirements. Our results provide the first demon- 
stration that near optimal space and near constant time is achievable for a more 
challenging problem. 

For these previous exact matching methods to work, properties of the periods 
of strings form a crucial part of their analysis. However, when considering parame- 
terised matching the period of a string is a much less straightforward concept than 
it is for exact matching. For example, it is no longer true that consecutive matches 
must either be separated by the period of the pattern or be at least m/2 symbols 
apart allows. This property, which does hold for exact matching but does not in the 
parameterised case, allows for an efficient encoding of the positions of the matches 
and is crucial to reducing the space requirements of the previous streaming algo- 
rithms. Unfortunately parameterised matches can occur at arbitrary positions in 
the stream, requiring us to find new ways of reducing the storage space used. 

This is however not the only challenge that needs to be tackled. A natural 
way to match two strings under parameterisation is to consider their predecessor 
strings. For a string T, the predecessor string pred(r) is a string of length \T\ with 
the property that pred(r)[i] is the distance counted in numbers of symbols to the 
previous occurrence of the symbol T[i] in T. In other words, pred(T)[i] = d, where 
d is the smallest positive non-zero value for which T[i] — T[i~d]. Whenever no such 
d exists, we set pred(T)[i] — 0. As an example, if T = aababcca then pred(r) = 
01022014. We can now perform parameterised matching offline by only considering 
predecessor strings using the fundamental fact that two equal length strings S and 
S' have a parameterised match if and only if pred(S') — pred(5') A plausible 
approach to solving the streaming problem would now be to translate the problem 
of parameterised matching in a stream to that of exact matching. This could be 
achieved by converting both pattern and stream into their corresponding predecessor 
strings and maintaining fingerprints of a sliding window of the translated input. 
However, consider the effect on the predecessor string, and hence its fingerprint, of 
sliding a window in the stream along by one. The leftmost symbol x, say, will move 
out of the window and so the predecessor value of the new leftmost occurrence of 
X in the new window will need to be set to and the corresponding fingerprint 
updated. We cannot however afford to store the positions of all characters in even 
a single window of the text as this will take 0{m) space. 

We will show a matching algorithm that solves these problems and others we 
encounter en route in near constant time per arriving symbol and minimal space. It 
turns out that achieving the desired space bound without regard to running time is, 
although by no means trivial, still relatively straightforward compared to the prob- 
lems encountered tackling both time and space simultaneously. To achieve our final 
goal the solution we give de-amortises the entire matching process, spreading the 
work across the time taken by incoming symbols. A number of technical innovations 
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are now required as a result of this de-amortisation. These will include, amongst 
others, new uses of fingerprinting method, compressed encodings, a separate de- 
terministic algorithm designed for prefixes of the pattern with small parameterised 
period as well as a careful scheduling of work to ensure that preliminary answers 
are computed in time to output matches as soon as they occur. 

2 Our new results 

Our main result is a fast and space efficient algorithm to solve the problem of finding 
matches in an unlabelled stream. 

Theorem 1. There is a randomised algorithm that finds matches in an unlabelled 
stream and runs in 0(-\/log \S\/ log log time in the worst case per arriving 
symbol and 0{\S\ log to) words of space, where \S\ is the number of distinct symbols 
in the pattern. The probability that the algorithm outputs correctly at all alignments 
of an n length text is at least 1 — l/n'^, where c is any constant. 

The running time is therefore near optimal. The full running time is in fact 
dominated by the complexity of the operations insert, delete and lookup in a dy- 
namic dictionary containing \S\ distinct elements. Using a worst case variant of 
exponential search trees [3| we achieve the quoted time complexity but any suit- 
able dictionary data structure can be substituted without change to the overall 
algorithm. 

We also give a separate and somewhat simpler deterministic solution which uses 
0{\U\+ p) words of space, where p is the parameterised period (q.v. Section [^TT]) of 
the pattern. The time complexity matches that of our randomised solution. We use 
it as a special case for our main randomised algorithm but it may be of independent 
interest in cases where the pattern has small parameterised period. 

Theorem 2. There is a deterministic algorithm that finds matches in an unlabelled 
stream and runs in 0{y^log log log time in the worst case per arriving 
symbol and 0{p + |i7|) words of space, where p is the parameterised period of P. 

To complete the picture we give nearly matching space lower bounds which show 
that our solutions are optimal to within log factors. The proof is by a relatively 
straightforward communication complexity argument. In essence one can show that 
in the randomised case Alice is able to transmit a complete string of length J7(|i7|) 
bits to Bob using a solution to the matching problem by choosing a suitably crafted 
pattern and streaming text. Similarly in the deterministic case one can show that she 
can send a bit string of length /^d-S"! + p) bits. The proof is deferred to Appendix [Xl 

Theorem 3. There is a randomised space lower bound for the problem of finding 
matches in an unlabelled stream of f2(\IJ\) bits. There is also a deterministic space 
lower bound of + p) bits for the same problem. 

2.1 Facts, notation and definitions 

We use the term p-match for a parameterised match and define the parameterised 
period (p-period) of a string P as the smallest p > such that P[0 . . . (to — 1 — p)] 
p- matches P[p . . . m — 1] . 
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We will make extensive use Rabin-Karp style fingerprints of strings which we 
define as follows. Let p > |Z'| be a prime and choose r G uniformly at random. 
For a string 5', the fingerprint (l){S) is given by 

fe=0 

A critical property of the fingerprint function (f> is that the probability of achiev- 
ing a false positive, that is the probability that P{(j){S) = (/>(S") A S ^ S') ^ 
[^[/(p — 1) (see (l2l. [ist for proofs). As we assume the RAM model with word size 
&{\ogn), where n is the total length of the stream, we can therefore choose p = 
for any constant c, giving a false positive probability asymptotically no more than 
l/n°~^. In particular, as our randomised algorithm will make 0(n log n) fingerprint 
comparisons in total, we can instead choose p so that by the union bound, there are 
no false positives with the same probability. We assume that all fingerprint arith- 
metic is performed within Zp, in particular when subtracting one fingerprint from 
another; an operation we will need to do repeatedly. We will also take advantage of 
the following properties of fingerprints. 

Fact 1 Splitting: given (f){^S[a ■ . ■ c]) and (j){^S[a ... 6]), the value of 4'{^S\b + 1 . . . c]) 
can be computed in 0{\) time. Updating: The fingerprint of a string (j){S[a . . .c]) 
can he updated in 0(1) time given some index j € {a, . . . , c} and a new value S[j] 
for that position in the string. We will focus on setting certain values to zero. 

The main algorithm we present will try to match the streaming text with various 
prefixes of the pattern P. We define them along with some associated variables in 
the following definition. 

Definition 1. Let S — \S\ logm and let Pq to be the shortest prefix of P that has 
p-period greater than 3(5. We define s prefixes Pf of increasing length so that \Pi \ = 
2^\Po\ for I g {1, . . . ,s — 1}, where s is the largest value such that l^s-il ^ m/2. 
The final prefix Pg has length m — 4(5. For all I, let mi = \Pi\ denote the length of 
Pi- 

From Definition [T] it follows that s G O(logm). For our randomised algorithm 
in Section 13] we will assume that m > 14(5 to ensure that mi — mi-i ^ 3(5 for 
£ G {!,..., s}. If m ^ 14(5, or the p-period of P is 3(5 or less, we instead apply 
the deterministic algorithm of Section |3] to solve the problem within the required 
bounds. 

In order to determine if there is a p-match between the text and a pattern prefix, 
we will compare fingerprints of the various prefixes of the pattern with fingerprints 
of the streaming text. We will need three different fingerprint definitions to achieve 
this (see Fig. [T]) as well as a difference fingerprint. 

Definition 2. For any index i' and £, <Pi{i') (j>{pred(T[0 . . . {i' + mi — 1)])), 
$t,{i') =^ (/)(pred(r[0 ...{i' + me- l)])[(i' -I- mg^i) . . . (i' + me - 1)]) and $i{i') = 
(/)(pred(T[i' . . . (i' + mi — l)])[{i' + me-i) . ■ . (i' + me ~ I)])- The fingerprint difference 
Ae{i') '^'$e{n~M^')- 

111 less formal terms, with reference to Fig. [1] (Peii') is the fingerprint of the 
predecessor string of the whole text up to index i' + me — 1, 'Pe{i') is the fingerprint 
of the me-i length suffix of the predecessor string of the whole text up to index 
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{1 i'+ rrit-i - 1 i'+rrn-l 

Pt-i I I I I I I I m 

1 ^e-i{i') 

^ 1 Mi') 

^ , , |^(,/) 

^ , , 

Fig. 1. The three key fingerprints <P, <P and (p. 

i' + ini — 1, and "Peli') is similar to only that the predecessor string starts at 

index i' instead of at the very beginning of the text. Finally the fingerprint difference 
Ai{i') captures the contribution from the positions of the predecessor string of the 
text that point back beyond position i' . 

As we will see in the next section, all these rather intricate fingerprint definitions 
make the foundation of our algorithm. We first give a quick example of how our 
algorithm will take advantage of the properties of fingerprints. 

Example 1. Assuming that Pi-i p-matches the text at position i' (see Fig. [1] again), 
our algorithm will work out if the match at i' can be extended to Pg by computing 
<Pe{i')- Namely, if <Pi{i') = 0(pred(Pf )[r7i^_i . . . (m^ — 1)]) then by the splitting 
property of Fact [I] Pe p-matches T at position i' . Similarly, if Pi does not p-match 
T at position i' then with high probability, 'Pe{i') ^ (\){^Te.A(Pi)\mi-\ . . . {mg — 1)]). 
Our algorithm will compute by computing and Ae(i') and making use 

of the updating property of Fact [TJ 



3 The main matching algorithm 

Our solution works by finding matches within the stream of the pattern prefixes 
Pq, . . . , Ps defined in the previous section, using the observation that if a shorter 
prefix fails to match at a given position then there is no need to check matches for 
longer prefixes. Only if all prefixes match at a particular position we check if also the 
whole of P matches. To find if a pattern prefix matches, we maintain suitable finger- 
prints of the streaming text that we update as new symbols arrive and use them to 
compare to the fingerprints of the pattern prefixes in a similar fashion to Example [T] 
This overall description also matches that of previous work on exact matching in 
stream. However, as will become clear, in our case a considerable amount of work 
is required to simultaneously minimise the space and time requirements. 

At any given moment in time our algorithm runs three different processes which 
we label PI, P2 and P3. Each process takes 0(1) time per arriving symbol once 
the small-cost alphabet size reduction described in Section [3.41 has been performed. 
Under this reduction, we may assume that the input alphabet is of the form |£'| = 
{0, 1,2... \S\ — 1}. It is under this assumption our algorithm operates. Before de- 
scribing the processes in more detail, we give a brief overview below. Supporting 
lemmas for running time, space bounds and correctness are provided in Section [S] 
with proofs deferred to the appendices. 

Process PI is responsible for finding matches with prefix Pq only. To do this 
it calls a separate deterministic algorithm that we describe in Section H) When a 
matching position i' is found, it will be stored together with the fingerprint <?o(*') 
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in a queue called Mq so that process P2 can use it to check if it also matches the 
prefix Pi. Total space usage for process PI is 0{\S\ logm). 

Process P2 is responsible for finding matches for all prefixes Pe with £ ^ 1, 
but not for matching the whole pattern which is the responsibility of process P3. 
Process P2 is randomised and outputs potential matches up to 35 G 0(1171 logm) 
symbol arrivals after they occur. The delay is a consequence of our de-amortisation 
by spreading the work out over arriving symbols. 

Process P2 runs two subprocesses labelled P2a and P2b. The first subprocess 
does bookkeeping of positions of the text whose predecessor values have to be set to 
zero when taking the fingerprint of substrings of the text. To do this, s + 1 queues 
are used: T) and Di for each £ G {1, . . . , s}. Each queue has size (9(1^71) (which for 
I? requires a proof deferred to Section [S]) . 

Subprocess P2b establishes matches for each prefix Pi. Whenever prefix Pg is 
considered, the subprocess tries to determine if some match with Pi-i can be ex- 
tended to a match Pi. Matches with Pf_i are retrieved from a queue Mg-i and a 
newly found match with Pg at position i' is added to a queue M^, together with 
the fingerprint 'Pei'i')- We show in Section [S] that each such queue can be encoded 
in a compressed form using only OdZ"!) space. The subprocess P2b establishes a 
match at position i' by first computing the fingerprints ^t{i') and Ai{i'). The for- 
mer is derived from the fingerprint <p£_i(z') stored with i' in M£_i, and the latter 
is derived from the queue Df. Together the two fingerprints give <l^e{i'), which is 
used to determine a match (see Example [1]). Total space usage for process P2 is 
OdZ-llogm). 

Process P3 is responsible for finding matches of the whole pattern P and hence 
outputting the final answers in constant time. Whenever process P2 reports a match 
for Ps, which could occur with up to 36 delay, process P3 will naively work out if 
the match can be extended to the whole pattern P. This can be done fast enough as 
there are only 45 characters to check over the next S arriving symbols (recall that 
\Ps \ — m — 4(5). Space usage for procedure P3 is 0(|i7| logm) as 0{S) symbols need 
to be maintained. 



3.1 Process PI (£ = 0) 

Process PI finds matches in the stream with the pattern prefix Pq. From the defi- 
nition of Pq we have that if we remove the final character from it (giving the string 
P[O...TOo — 2]) then its p-period is at most 35. Recall that 6 — 1171 logm. The 
p-period of Pq itself could be much larger. As part of process PI we run a deter- 
ministic pattern matching algorithm on P[0 . . . mo — 2] that returns p- matches with 
the stream in real-time and (for our pattern) uses 0(1171 logm) space. This algo- 
rithm, whose space complexity depends on the p-period of the pattern, is described 
separately in Section |4] and is used here as a stand-alone subroutine. 

In order to establish matches with the whole of Pq we handle the final character 
separately. If the deterministic subroutine reports a match that ends in T[i — 1], when 
T[i\ arrives we have a p-matcli with Pq if and only if pred(r)[i] = pred(Po)[mo — 1] 
(or pred(T)[i] ^ tuq if pred(Po) [mo - 1] = 0). 

Whenever process PI finds a match with Po at position i' of the text, the pair 
{i','Po{i')) is added to a (FIFO) queue named Mq, which is queried by P2. 
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i' i'+me-i-l ® i'+mt-l ® © 

^1 I / ^ ^ I . ^ ^ 

Pi-1 I I I I I I I I I I I I I |^^ ^^-»-« — 2(5 — ► --S-^ — 2(5 — ► 

Fig. 2. Pi-i and Pi both p-match T at position i'. Tlie p-match witti P( 
is a(ided to A/f_i during interval A. Subprocess P2a ensures that by the 
end of interval B, Di contains the required elements to compute Z\^(z'). 
Subprocess P2b finds the p-match with Pi during interval C. 



3.2 Process P2 {£ > 0) 

Process P2 finds matches in the stream with the pattern prefixes Pi, . . . ,Ps. We 
split the discussion of its execution into levels and say that level £ corresponds to 
process P2 as it is looking for matches with prefix Pi. Each level is responsible for 
reporting the matches that occur between its prefix and the different positions in 
the stream and this information is then used by the subsequent level. 

Process P2 computes for each level £ ^ 1 the fingerprint and Ai{i') for 

each position i' at which Pi-i p-matches the text. Then, as set out in Example [TJ 
if 'Pi{i') — Ai{i') (i.e., <Pi{i')) equals (/)(pred(P£)[m£_i . . . (m^ — 1)]), there is also a 
match with Pi at i' . The algorithm will in this case add the pair {i' ,<!>i{i')) to a 
queue M^, which is subject to queries by level £ + 1. 

In order for Process P2 to spend only constant time per arriving symbol, all its 
work must be scheduled carefully. The preparation of the Ai values takes place as a 
subprocess we name P2a. Computing the values and establishing matches takes 
place in another subprocess named P2b. The two subprocesses are run in sequence 
for each arriving symbol. Fig. [2] provides a diagrammatic description of the overall 
running of P2. We now give details of the two subprocesses. 

Subprocess P2a (preparing the Ai values) There is a queue Di associated 
with each level 1^1. These queues hold positions of the streaming text whose 
predecessor value points back far enough to be changed to for our purposes. 

For each arriving symbol T[i\ we check whether pred(r)[z] > toq. If so, the pair 
(i, pred(T)[?]) is added to a (FIFO) queue called V, to be dealt with later. 

If the subprocess P2a is currently not in the state of processing an element from 
the queue P, it will now remove an element from V (unless V is empty). Call this 
element («', pred(T) [«']). Over the next s arriving symbols, the subprocess P2a will 
process this element as follows. For each of the s levels £ ^ 1, if pred(r)[z']) > rrii-i, 
add («', pred(r) [i']) to the queue Di. If Di contains more than 12|Z'| elements, 
discard the oldest element. As we will see shortly, the subprocess P2b will use the 
information from the Di queues to work out the Ai values when needed. 

Subprocess P2b (finding matches for all prefixes) This subprocess schedules 
the work across the levels in a round robin fashion by only considering level £ = 
1 + (i mod s) when the symbol T[i] arrives. Potential matches may not be reported 
by this subprocess until 0(|£'| logrn) arriving symbols after they occur. 

The subprocess P2b for level £ is always in one of two states. Either it is checking 
whether a matching position i' for Pi-i can be extended to a match with Pg, or 
it is idle, waiting to check some reported match with Pi-i. In the latter state, 
level £ looks into queue which contains matches with Pi^i. Whenever Aff_i 
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becomes or already is non-empty, level £ removes an element from Aff_i. Call this 
element (i', <Pi^i{i')). When i > i' + rrii + S (interval C in Fig. [5]), level £ will start 
checking if i' is also a matching position with Pg. It does so by first computing the 
fingerprint which from the definition equals 'Pe{i') — We can ensure 

the fingerprint is always available when needed by maintaining a circular 

buffer of the most recent log to) fingerprints of the text. 

Over the next at most \U\ arriving symbols for which P2b is considering level 
£, the subprocess P2b will compute Ai{i') by stepping through the elements of the 
queue D^, aggregating those elements that contribute to the value of Ag{i'). An 
explicit formula for this calculation is given in Lemma [S] in Section \5\ Once this is 
done, the fingerprint <Pi{i') — — Ai{i') can be computed and compared to 

(/)(pred(Pf)[TO£_i . . . {rrii — 1)]) as in Example[T] If the values are the same, we have 
a p-match with P( at position i' of the text, and the pair (i', <I>t(i')) is added to the 
queue Mg. This occurs before text index i' + mi + 35 arrives i.e. before the end of 
interval C in Fig. [21 Correctness and Time/Space bounds of P2a and P2b are given 
in Section [S] 

3.3 Process P3 (matching the full pattern) 

Levels greater than output the position of matches of various prefixes of the 
pattern with some delay. In order to report matches of the full pattern as soon as a 
new stream symbol arrives we need one further stage for our matching algorithm. 
We have that P2 outputs any p-match between P, — P[Q . . . {ra — A5 — 1)] and 
T at most 35 arrivals after it occurs, i.e. at least 5 arrivals before a full p-match 
with P occurs. Such 5 length gaps cannot overlap as Pg has p-period at least 3(5. 
This gives us 5 arrivals to directly compare pred(P)[(TO — 45) . . . (m — 1)] with 
pred(T)[(i' -|- to — 45) . . .{i' + m — 1)], which we obtain by buffering pred(r). By 
inspection of the definition of predecessor strings, this is sufficient to determine 
whether a full p-match with P occurs. The extra space used is dominated by the 
need to buffer the last 0{5) = 0{\X!\\ogm) values from pred(T) and the time is 
0(1) per character. 

3.4 Coping with larger alphabets 

The methods we have described require the input alphabet to be in the range 
{0...|Z'| — 1}. In order to handle larger alphabets we map the input alphabet to this 
range as the stream symbols arrive using a dynamic dictionary. Let T' denote the 
stream after this mapping has been applied. We construct a mapping such that for 
all i, pred(T[i — m -I- 1 ... i]) = pred(T'[i — to -f- 1 ... i]) if T[i — m + 1 . . . i] contains 
\IJ\ or fewer distinct symbols. If T[i — to -I- 1 . . .i] contains more than \IJ\ distinct 
symbols then there can not be a p-match. Using a dictionary based on exponential 
search trees Lemma [T] below gives us the desired result. It is in fact only this 
mapping process required for larger input alphabets which prevents the algorithms 
we present from running in real-time. 

Lemma 1. There is a mapping that reduces the input alphabet to be in the range 
{0...\U\ — 1}, runs in 0(-\/log log log \IJ\) time per arriving character and uses 
0{\S\) space, where \S\ is the number of distinct symbols in the pattern. This map- 
ping preserves the locations of the p-matches in the text stream. 
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4 The deterministic matching algorithm 

We now describe a deterministic algorithm for parameterised matching in a stream 
which requires 0{p + \E\) space, where p is the parameterised period (p-period) of 
P. This algorithm is used to search for the prefix Pq in the full matching algorithm. 

Theorem 4. Streaming parameterised matching can be solved deterministically in 
real-time and 0{p + \IJ\) space, where p is the p-period of P and the pattern and 
text alphabets are of the form S = {0, 1,2,..., \U\ — 1}. 

We first briefly summarize the overall approach of [l[ which our algorithm fol- 
lows. Whenever some T[i] arrives, the overall goal is to calculate the largest £ such 
that P[0 . . p-matches T[i — £-\-l...i]. A p-match occurs iS £ — m. When a new 

text character T[i + 1] arrives the algorithm compares pred(r)[i + 1] to pred{P)[£] 
to determine whether P[0 . . .£] p-matches T[i — £ -\- l...i + 1] inO(l) time. If there 
is a p-match, we continue with the next text character. If there is not, we shift 
the pattern prefix, P[0 . . .£ — 1] along by its p-period pi so that it is aligned with 
T[i — £ -\- Pi -\- l...i]. This is the next candidate for a p-match. In the original algo- 
rithm, the p-periods of all prefixes are stored in an m-length array called a prefix 
table. 

The main hurdle we must tackle is to store both a prefix table suitable for 
parameterised matching as well as an encoding of the pattern in only 0{p -\- 
space, while still allowing efficient access to both. It is well-known that any string P 
can be stored in space proportional to its exact period. In Lemma [51 which follows 
from Fact [51 we show an analogous result for pred(P). 

Fact 2 For any j e [p] there is a constant kj such that pred(P)[j + kp] is zero for 
k < kj, and Cj for k ^ kj, where Cj ^ 1 is a constant that depends on j . 

Lemma 2. The predecessor string pred(P) can be stored in 0{p) space, where p 
is the p-period of P. Further, for any j £ [m] we can obtain pred(P)[j] from this 
representation in 0(1) time. 

We now show how to store the parameterised prefix table in only 0{p) space, in 
contrast to 0{m) space which a standard prefix table would require. The p-period 
Pi of P[0 . . .£] is, as a function of £, non-decreasing in £. This property enables us to 
run-length encode the prefix table and store it as a doubly linked list with at most 
p elements, hence using only 0{p) space. Each element corresponds to an interval 
of prefix lengths with the same p-period, and the elements are linked together in 
increasing order (of the common p-period). This representation does not however 
allow 0(1) time access to the p-period of any prefix, however, for our purposes 
it will suffice. To accelerate computation we also store a second linked list of the 
indices of the first occurrences of each symbol in P in ascending order, i.e. every j 
such that pred(P)[j] = 0. This uses 0(1^71) space. 

There is a crucial second advantage to compressing the prefix table which is that 
it allows us to upper bound the number of prefixes of P we need to inspect when 
a mismatch occurs. When a mismatch occurs in our algorithm, we repeatedly shift 
the pattern until a p-match between a text suffix and pattern prefix occurs. Naively 
it seems that we might have to check many prefixes within the same run. However, 
by Lemma [3] (which follows from Fact [5|) we are assured that if some prefix does 
not p-match, every prefix in the same run with pred(P)[7] ^ will also mismatch 
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Fig. 3. A typical state of the deterministic real-time algorithm. 



(except possibly the longest). Therefore we can skip inspecting these prefixes. By 
keeping pointers into both linked lists, it is straightforward to find the next prefix 
to check in 0(1) time. Whenever we perform a pattern shift we move at least one 
of the pointers to the left. Therefore the total number of pattern shifts inspected 
while processing T[i] is at most 0{\IJ\ + p). As each pointer only moves to the right 
by at most one when each T[i\ arrives, an amortised time complexity of 0(1) per 
character follows. The space usage is 0(|-£'| + p) as claimed, dominated by the size 
of the linked lists. 

Lemma 3. Let j be such that pj = Pj+i- pred(P)[j — pj] = {pred(P)[j], 0}. 

We now briefly discuss how to de-amortise our solution by applying Galil's KMP 
de-amortisation argument (lo| . The main idea is to restrict the algorithm to shift 
the pattern at most twice when each text character arrives giving a constant time 
algorithm. If we have not finished processing T[i] by this point we accept T[i + 1] 
but place it on the end of a buffer, output 'no match' and continue processing 
T[i]. The key property is that the number of text arrivals until the next p-match 
occurs is at least the length of the buffer (see Fig. As we shift the pattern up to 
twice during each arrival we always clear the buffer before (or as) the next p-match 
occurs. Further, the size of the buffer is always 0(|^| + p). This follows from the 
observation above that the number of pattern shifts required to process a single 
text character is 0(1^71 -I- p). This therefore establishes Theorem Combining this 
result with Lemma [T] gives us Theorem [5J The full details are omitted due to space 
constraints. 

5 Supporting lemmas: running time, space bounds and 
correctness 

We begin by considering subprocess P2a which maintains the required data struc- 
tures to efficiently compute the required Ag values. It follows directly from the 
algorithm description that P2a takes 0(1) time per character and uses Od-S"] logm) 
space to store the queues Di for all £. Further in LemmalU we show that the queue 
V uses only 0(|i7|) space. In simple terms this is because the same symbol can only 
be inserted into T> at most every (5 = |Z'| logm arriving symbols. Coupled with the 
fact that we remove one element every s € O(logm) arrivals, this ensures that we 
only ever maintain one element for each symbol in the alphabet. 

Lemma 4. After any text arrival, for any a € S there is at most one element 
(i', pred(T)[i']) in V with T[i'] — a and therefore V requires only 0{\E\) space. 

We now turn our attention to subprocess P2b which finds matches with each 
of the prefixes Pi , . . . , ■ We begin in Lemma [S] by giving an explicit expression 
for Ai{i') in terms of Dg. The lemma follows from the fact that any text index, fc, 
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which contributes to some Ai(i') has pred(T)[A:] > m^_i and that there can only 
be O(l^l) such indices in any 0{me-i) length window of T. 



Lemma 5. When T[i\ arrives, for any i 



'.' with z — 3(5 < «' + mi ^ i ~ S, 




i'+m£ — 1 

y d{k) ■ r*^ mod p where d{k) 



h if (fc, h) G Di and h > k — i' 
otherwise. 



k—i'+m^ — i 



Consider the point in Subprocess P2b when computation on in 
Mi-i has just begun. This occurs when the first index i > i' + mi + 6 arrives 
such that £ — 1 + {i mod s). To compute (Piii'), P2b requires <?£_i(i') and 'Pi{i') 
(which are readily available) and Ai{i'). To compute Ai{i'), P2b inspects a constant 
number of elements of Di once in every s text arrivals. Therefore as \Di\ ^ 12|Z'[, 
all elements can be inspected before index i' + mi + 2S arrives, as required. From 
Lemma [S] we have that Ai{i') and hence ^i{i') is computed correctly. Further, as 
i' + mi + 25 ^ i' + m^+i, we have that P2b adds {i',<Pi{i')) to the queue Mi (if a 
match occurs) before it is needed by level £+1. As the p-period of Pi is more than 
3(5, any two matches in Mi are at least 3^ positions apart so there is no risk that 
subprocess P2b will overlook an element of Mi while processing another. 

We now examine the space usage for the Mi queues. For all £, whenever a match 
with Pi at some position i' of the text is found, the pair {i',<Pi{i')) is inserted into 
the queue Mi. This pair will stay in Mi until it is removed by the subprocess P2b 
for the purpose of determining whether i' is also a matching position with Pi+i- 
Despite the delays by which our algorithm perform certain actions, it should not be 
too difficult to verify that the pair is always inserted into Mi before it is needed. 
However, it could take up to 2mi + 2d text symbol arrivals until the pair is removed. 
In a windows of this length there could be up to 0{mi/ pg) matching positions with 
Pf, where pi is the p-period of Pg. When pi is relatively small, explicitly storing this 
number of pairs would require much more than 0(1^71) space. As we will see in the 
next lemma, there is a succinct data structure that allows us to store every pair in 
Ml in only OdZ"]) space. 

Lemma 6. For every £, there is a data structure for Mi that uses only 0{\S\) 
space. Both retrieving and inserting a pair take 0{1) time. 

The proof of this lemma is arguably the most involved part of the paper, tak- 
ing advantage of the properties of parameterised matches. Unlike exact matching, 
parameterised matches that are not too far apart can occur at an arbitrary dis- 
tance from each other, prohibiting a space efficient representation of their locations. 
Fortunately however, it turns out that only OdZ"!) matches show this arbitrary be- 
haviour. Over a certain window of frequent matches, only the first 0(1171) matches 
appear random, after which subsequent matches are evenly spread out with the 
same distance apart. We can therefore store the first matches explicitly and encode 
the other matches with an arithmetic progression. Further, the fingerprints that are 
stored together with the matching positions of the arithmetic progression submit 
to a regular pattern that we can represent succinctly in 0(|i7|) space. 

We have shown that if all the fingerprint comparisons are correct (free from false 
positives) then our algorithm outputs exactly the locations where each Pi p-matches 
T at most 3(5 characters after they occur. As stated in Section [2. 11 the probability 
of a least one false positive occurring can be made as small as l/n"^ for any constant 
c. The overall space complexity is Od-S"! logm) and the time complexity is 0(1) per 
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character (after the alphabet reduction) as claimed. Coupled with the discussions 
in Sections 13.31 and regarding matching the full pattern and Section 13.41 regarding 
the alphabet reduction, this completes the proof of Theorem [1] 
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A Proof of Theorem [H] 



Proof (Proof of Theorem\^. Consider first a pattern where all symbols are dis- 
tinct, e.g. P = 123456. Now let us assume Alice would like to send a bit-string to 
Bob. She can encode the bit-string as an instance of the parameterised matching 
problem in the following way. As an example, assume the bit-string is 01011. She 
first creates the first half of a text stream aBcDE where we choose capitals to corre- 
spond to 1 and lower case symbols to correspond to from the original bit-string. 
She starts the matching algorithm and runs it until the pattern and the first half 
of the text have been processed and then sends a snapshot of the memory to Bob. 
Bob then continues with the second half of the text which is fixed to be the sorted 
lower case symbols, in this case abcde. Where Bob finds a parameterised match 
he outputs a 1 and where he does not, he outputs a 0. Thus Alice's bit-string is 
reproduced by Bob. In general, if we restrict the alphabet size of the pattern to be 
then Alice can similarly encode a bit-string of length \IJ\ — 1, and successfully 
transmit it to Bob, giving us an l^dZ"]) bit lower bound on the space requirements of 
any streaming algorithm. If randomisation is not allowed, the lower bound increases 
to -I- p) bits of space. Here p is the parameterised period of the pattern. This 

bound follows by a similar argument by devising a one-to-one encoding of bit-strings 
of length 0{p) into P[0 ... p — 1]. The key difference is that with a deterministic al- 
gorithm. Bob can enumerate all possible m-length texts to recover Alice's bit-string 
from P. 

B Proofs omitted from Section 13.41 

Proof (Proof of Lemma\T\). Let \S\ be the number of distinct symbols in P. Let 
i^T denote the text alphabet of the (unfiltered) stream. Let the strings S and S'fiit 
denote the last m characters of the unfiltered and filtered stream, respectively. Let 
^last C Z't denote the up to |Z'| -I- 1 last distinct symbols in S, hence |X'iast| is 
never more than |Z'| + 1. Let 7" be a dynamic perfect hash function on Z'last such 
that a symbol in can be looked up, deleted and added to T in expected 0(l) 
time 0, [3|- Every symbol that arrives in the stream is associated with its "arrival 
time", which is an integer that increases by one for every new symbol arriving in 
the stream. Let C be an ordered list of the symbols in ^last (together with their 
most recent arrival time) such that C is ordered according to the most recent arrival 
time. For example, 

(d,25), (b,33), (g,58), (e,102) (1) 

means that the symbols b, d, e and g are the last four distinct symbols that appear 
in S (for this example, jZ"! -|- 1 ^ 4), where the last e arrived at time 102, the last 
g arrived at time 58, and so on. 

By using appropriate pointers between elements of the hash table T and elements 
of C (which could be implemented as a linked list), we can maintain T and C in 
time 0(1) per arriving symbol. To see this, take the example in Equation ([T]) and 
consider the arrival of a new symbol x at time 103 (following the last symbol e). 
First we look up x in T and if x already exists in ^last , move it to the right end of 
£, by deleting and inserting where needed and update the element to (x, 103). Also 
check that the leftmost element of C is not a symbol that has been pushed outside 
of S when x arrived. We use its arrival time to determine this and remove the last 
element accordingly. If the arriving symbol x does not already exist in Sust, then 
we add (x, 103) to the right end of £. To ensure that C does not contain more than 
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171 + 1 elements, we remove the leftmost element of C if necessary. We also remove 
the leftmost symbol if it has been pushed outside of S. The hash table T is of course 
updated accordingly as well. 

Let i^fiit = {0, . . . , denote the symbols outputted by the filter. We augment 
the elements of C to maintain a mapping M from the symbols in Z'last to distinct 
symbols in i^fiit as follows. Whenever a new symbol is added to U\ast, map it to 
an unused symbol in X!fi\t- If no such symbol exists, then use the symbol that is 
associated with the symbol of i^iast that is to be removed from i^iast (note that 
l^iastl ^ I'S'fiitl). The mapping Ai specifies the filtered stream: when a symbol x 
arrives, the filter outputs A4{x). Finding A4{x) and updating T is done in 0(1) 
time per arriving character, and both the tree T and the list C can be stored in 
0{\E\) space. 

It remains to show that the filtered stream does not induce any false matches or 
miss a potential match. Suppose first that the number of distinct symbols in 5 is 
or fewer. That is, Sust contains all distinct symbols in S. Every symbol a; in 5 has 
been replaced by a unique symbol in Sfm and the construction of the filter ensures 
that the mapping is one-to-one. Thus, pred(5fiit) — pred(S'). Suppose second that 
the number of distinct symbols in S is \S\ -I- 1 or more. That is, |i7iast| = |^| + 1 
and therefore Sait contains \S\ + 1 distinct symbols. Thus, pred(S'fiit) cannot equal 
pred(P). 

C Proofs omitted from Section [4] 

We first prove two supporting lemmas. 

Proof (Proof of Fact 0). Let p be the period of P. We prove the lemma by 
contradiction. Suppose, for some j and fc, that i = j + kp is a position such that 
pred(P)[i] — c ^ 1 and pred(P)[i + p] = c' c. Consider Figure |3] for a concrete 
example, where p = 5, i = 12, pred(P)[12] = c = 4 and pred(P)[12 + 5] = c' = 3. 
Since p is a period of P, we have that 

pred(P[/3 ... m — 1]) = pred(P[0 ... to — 1 — p]) . 

Consider the alignment of positions i + p and i (positions 17 and 12 in Figure S]). 
We have that pred{P[p . . .m — is either c' or 0. In either case, it is certainly 
not pred(P[0 . . . m — 1 — p\)[i] which is c. Thus, p cannot be a period of P. 

Proof (Proof of Lemma\^. By Fact[5]we can encode pred(P) by storing the two 
values kj and Cj for each j 6 [p]. This takes 0{p) space. The value pred(P)[i] is 
if i < k(^i mod p) , otherwise it is C(j ,„od p) • 
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D Proofs omitted from Section [S] 

Proof (Proof of Lemma Proof by induction. After T[0] is processed, V is 
empty, providing a base case. Let T[i] = cr be the most recently arrived character. 
We assume by the strong inductive hypothesis that for all i' < i, V contains at most 
one occurrence of each symboL If pred(r)[z] < mg then i is not added to 2? and we 
are done. Therefore, we assume that pred(r)[i] > mg. By the inductive hypothesis, 
after T[i] arrives, V contains at most one occurrence of each symbol except possibly 
a. We now show that a occurs in V exactly once after T[i] = a arrives. 

Consider the state of V after T[i — pred(T)[z]], the last a, arrived. At that 
point, T> contained at most \IJ\ entries. After s arrivals after T[i — pred(r)[z]] , the 
occurrence of a will have moved up one position in V. After \S\s positions, the a 
will have been removed from the queue. However, toq must be at least the p-period 
of P[O...TOo — 1] which is greater than 3S. Therefore, as toq > 36 > \S\s it is 
removed before T[i] — a arrives. 

Proof (Sketch proof of Lemma\^. By definition we have that Ag{i') — <Pi{i') — 
By rearranging the definitions of and it follows that, Ai{i') = 

d'{i'+j)-r^'+^ mod p. Here d'{i'+j) = pred(T)[z'+j] if pred(r)[i'+j] > j 
and otherwise. As j ^ m^_i every such {i' +j) will be inserted into Di. The claim 
follows by observing that by the algorithm description these indices will be present 
in Di while i — < i' + mi ^ i — S. 

D.l Storing Mf for all £ 

We now introduce a some notation, after which we give a few lemmas that will be 
useful for the proof of Lemma [12] which in turn will allow us to prove Lemma [6l 

We write P = T to denote a parameterised match between P and T. An alphabet 
can be augmented with the special symbol which is used to represent a so called 
wildcard symbol or a "don't care" symbol. In terms of matching, the symbol "*" is 
allowed to match any other symbol of the alphabet without causing a mismatch. 
For example, the string ab*a matches both abba and abca. In addition to the 
predecessor string pred(T) of T, we define the wildcard predecessor string of T, 
denoted pred*(T), to be identical to pred(T) only with the zeros of pred(T) being 
replaced by the symbol Thus, if pred(T) = 0102201 then pred*(r) = ★1*22*1. 

Lemma 7. Suppose S is a string and pred(S')[i . . . i + m] — pTed{S)[j . . . j + m] 
for some i, j and m. Then pred(S'[i . . .i + m]) = pred(S'[j . . . j ' + m]). 

Proof. The statement of the lemma follows directly from the observation that 
pred(S'[i . . .i + m\) is uniquely obtained from pred(S')[i . . A + m] by setting ev- 
ery position d for which (pred(S')[i . . . i + m])[d] > d to zero. 

Lemma 8. Suppose T is a string of length n and P is a string of length m ^ 
n. The string P parameterise matches T at position i if and only if pred'^(P) = 
pred(r) [i . . . i + m — 1] and the number of distinct symbols inT[i . . .i-\-m—l] equals 
the number of distinct symbols in P. 

Proof Let R = pred(r) [i . . .i + m - 1]. We first show that ii P = T[i . . .i + m - 1] 
then pred*(P) = R. Let R' = pred(T[i ...i + m-1]) and note that R and R' differ 
only at positions j where R[j] > j and R'[j] — (the previous occurrence of symbol 
T[i + j] is before index i). Since P = T[i . . .i + m — 1], we have that pred(P) = R', 
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hence pred*(P) = R. Further, since P = T[i . . . i + m — 1], the number of distinct 
symbols in T[i . . .i + m — 1] must equal the number of distinct symbols in P. 

Now suppose that pred*(P) — R and suppose Tigymbs is the number of distinct 
symbols in T[i . . .i + m — 1], which equals the number of distinct symbols in P. 
We will show that P =^ T[i . . .i + m — 1]. Since pred*(P) ~ R we also have that 
pred*^(P) = R' . The number of zeros in R' is ngymbs and the number of wildcards 
"★" in pred*(P) is also risymbs- Hence pred(P) = R' , which implies that P = 
T[i . . .1 + m — 1]. 

Lemma 9. Suppose S is a string of length n and a is a number such that the prefix 
S[0 ... a] contains all distinct symbols in S. If p is a parameterised period of S then 
p is an exact period of pred(S')[Q! + 1 ... n — 1] (not necessarily the shortest exact 
period). 

Proof. Since p is a period of S, we have that S[0 ... n — 1 — p] parameterise matches 
5" at position p. From Lemma [5] we have that 



where the left-hand side is identical to pred*(S')[0 ... n — 1 — p]. The equation must 
hold for any corresponding substrings of the left-hand side and right-hand side. In 
particular, if we ignore the first a + 1 characters, we have 

pred*(S')[a + l...n-l-p]= pred(S')[p + a + 1 . . . n - 1] . 

The left-hand side does not contain any wildcards since all distinct symbols in S 
occur in S'[0 . . . a]. We can therefore rewrite the equation above as 

pred(S') [a + l...n-l- p]= pred(S') [p + a + 1 . . .n - I] . 

From the definition of period, it follows that p is an exact period of pred(S')[Q; -|- 
l...n- 1]. 

Lemma 10. Suppose S is a string, and Sprei is a prefix of S and Ssui is a suffix of 
S. If p is a period of both Sp^ei and Ssui, and the length of the overlap of Sp^et and 
Ssuf is at least p, then p is a period of S. 

Proof. Consider S aligned with itself shifted p steps in Figure [5j The shaded area 
is the overlap of Sp^ef and S'suf. Since p is a period of both S'pief and S'suf, there 
is a match at every position of the alignment, with the possible exception of the 
positions marked by the thick vertical line segments. If the length of the overlap is 
at least p, no such positions exist. 

Lemma 11. Suppose S is a string of length n over the alphabet S. Let a be the 
smallest number such that all distinct symbols in S occur in the prefix S[0. . .a]. 
Then the parameterised period of S is at least a/\S\. 



pred*(5'[0 ...n-l-p])= prcd(S')[p . . . n - 1] , 
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Fig. 6. Diagram supporting the proof of Lemma [TTJ 



Proof. Let p be the parameterised period of S. From the definition of a, we have 
that the symbol S[a] does not occur before position a. Consider the diagram in 
Figure [6l where ^[q;] — e. Since S parameterise matches itself when shifted p steps, 
we have that the symbol S[a — p] cannot occur before position a — p as this would 
require S[a\ to occur before position a. We repeat this argument by shifting again, 
and conclude that the number of symbols \S\ ^ a/p. Thus, p ^ a/jZ"!. 

Recall that an arithmetic progression is a sequence of numbers such that the 
difference between any two successive numbers in the sequence is constant. We 
can specify an arithmetic progression by its start number, the difference between 
successive numbers and the length of the sequence. For notational convenience, we 
think of an arithmetic progression as a set of numbers (for which there is a very 
succinct representation) . In the next lemma we will see that the positions at which 
a string P of length m parameterise matches a longer string T of length 3m/ 2 can 
be stored in small memory: either a matching position belongs to an arithmetic 
progression or it is one of very few positions that can be listed explicitly. We will 
now show how this fact can be used to store all the required fingerprints in OdZ'l) 
space (per level). 

Lemma 12. Let X {n — 3m/2, . . . , n — 1} be the set of positions at which P 
p-matches within some 3m/2 length suffix ofT. There exists a set Y with \Y\ ^ 
and an arithmetic progression A such that Y U A ^ X . Further, 

1. all positions in Y are to the left of the leftmost position of A, 

2. the distance between any two (consecutive) positions in A is p, and 

3. pred(T)[(i + m — p) . . . (i + m — 1)] is the same for all i Cz A. 

Proof. Since T is arbitrarily long and we are concerned with matches of the 3m/2 
length suffix of T, we conceptually think of T as an array where all indices have been 
shifted by n — 3m/2. That is, T[—n + 3m/2] is the first character of T, T[0] is the 
first character of the 3m/2 length suffix of T and T[3m/2 — 1] is the last character 
of T. We are now concerned with matches between P and r[0. . .3m/2 — 1]. We 
may of course translate back to the normal indexing by adding the offset n — 3m/2. 

We may assume that P parameterise matches T at position (i.e, the leftmost 
position of the 3m/2 length suffix of T). Let p be the parameterised period of P. 
Let a be the smallest number such that all distinct symbols in P occur in the prefix 
P[0 . . .a]. By Lemma [TT] we have that p ^ a/jil'l. 

First consider the case a ^ m/A. This implies that p ^ m/{A\S\) and the total 
number of positions where P can parameterise match T is upper bounded by 

\T\ ^ 3m/2 
p m/{4\S\) ^ I I- 

All these positions can be stored in the set Y. 

Now consider the case a < rn/4. Suppose first that p ^ to/8. The number 
of positions at which P can parameterise match T is then upper bounded by the 



17 



T 
P 



^ right 



'right 






rn 
4 


m 
2 


3m 

4 


rn 


5m 
4 


2m 
3 
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constant 12. We therefore continue with the assumption that p < to/8. As p ^ 
a/\E\^ there are at most {a + l)/(a/|i7|) ^ 2|Z'| positions from the set {0, . . . , a} 
at which P can parameterise match T. We can store these positions in the set Y . 
Next we will show that the positions from the set {a + 1, . . . , 3to/2 — 1} at which 
P parameterise matches T can be represented by the arithmetic progression A. 

Since all distinct symbols in P occur in the prefix P[0 ... a] of P, it follows from 
Lemma |n] that p is an exact period of pred(P)[Q; + 1 ... to — 1] (not necessarily the 
shortest period). We have assumed that P parameterise matches T at position 0, 
so 

pred(P) = pred(T[0. . .TO - 1]) = pred(r)[0 . . . to - 1] , 
which certainly implies that 

pred(P) [a + 1 ... TO - 1] = pred(T) [a + 1 ... to - 1] . 

Thus, p is also an exact period of pred(T)[Q; + 1 ... m — 1]. 

Let ifight be the rightmost position at which P parameterise matches T. Since 
all distinct symbols in P occur in P[0 . . . a] we have that 

pred(P) [a + 1 ... TO — 1] = pred(r) [iright + a + l . . . inght + to — 1] , 

hence p is an exact period of pred(r)[«iight + a + 1 . . . iright + to — 1]. 

For the remaining part of this proof it may help to consider the illustrative ex- 
ample in Figure [71 If iright ^ ot then there are no positions to store in the arithmetic 
progression A and we are done. Suppose therefore that iright ^ oi- We must have 
bright + TO ^ 3to/2, which with a < to/4 implies that iright + a < 3m/ A. Thus, if we 
let 

S — pred(T)[Q; + 1 . . . iHght + to - 1] , 
S'prof = pred(T)[a + 1 . . . m - 1] , 
S'suf = pred(T)[i,.ight + a + l... inght + to - 1] , 

where Spref is a prefix of S and Ssuf is a suffix of S, we have that the length of the 
overlap of Spicf and Ssuf (shaded section in Figure [T]) is 

TTl 

(to - 1) - (bright + + l > — -1 > 2/9-1 ^ p. 

We have shown above that both Spref and Ssuf have the exact period p, so by 
Lemma [TU] it follows that p is also a period of S. 

Let i be any position from the set {a + l, ... , iright ~ 1} such that P parameterise 
matches T at i. If no such i exists then iright is the only position to store in the 
arithmetic progression A. We will now show that if some i exists then P must also 
parameterise match T at position i + p. By induction, starting with the smallest i, 
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the set {«, (i + p), (i + 2p), . . . , iright} is the set of positions at which P parameterise 
matches T. This set is an arithmetic progression. 

As p is an exact period of 5, any two substrings of S that are p positions apart 
must be identical. Thus, 

pred(r) [i . . .i + m — 1] = pred(T) [i + p . . .i + p + m — 1] . 

By Lemma [7] we have that 

pred(T[i . . . i + TO — 1]) = pred(T[i + /?... i + p + to, — 1]) , 

which imphes that P parameterise matches T at position i + p. 

Finally, the two properties in the statement of the lemma follow from the proof. 

□ 

Proof (Proof of Lemma\Bj). To take advantage of Lemma[T2]we first conceptually 
partition the text into overlapping substrings r[fc(TO^_i/2) . . . (fc + 3)(TOf_i/2) — 1] 
for all k £ {0,1,2,...}. Consider the set of p-matches for P^-i that have been 
outputted by level i — 1 but level £ has not yet outputted 'p£. As mi/mi^i is 
constant, at any time these p-matches are contained within a constant number of 
distinct partitions. By applying Lemma [T^] using Pi-i as the pattern and we have 
that the p-matches in any partition form two sets Y and A, where A is an arithmetic 
progression. The set Y contains at most Q\S\ occurrences. By the first property in 
the statement of Lemma [1^ the occurrences in Y all occur before the occurrences 
in A. Therefore, we store { | i' e F } explicitly (in left to right order) along 

with the text positions i' to which they correspond. This requires only O(IZ'I) space. 
Any additional occurrences which arrive must form an arithmetic progression. By 
the second and third properties in the statement of Lemma [T^l we can store the set 
{<l>i-i{i') I i' e ^} by storing only (/)(pred(r)[(ii +TO£_i + 

me-i — 1)]) and the progression A itself, where ii is the first position in A and pe-i 
is the parameterised period of Any fingerprint in this set can be recovered in 

constant time as required by simple fingerprint arithmetic. We have therefore stored 
all the required fingerprints which occur in a single partition in Odi^l) space and 
as observed we only need to consider a constant number of partitions at one time 
to maintain Mi. 
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